Replication Tutorial

Lars Vilhuber
April 2019

Cornell University

Overview

  • High-level overview (15:00)
  • Details of Reproducibility Checks (15:00)
  • A concrete example

Replication and Reproducibility in Social Sciences and Statistics: Context, Concerns, and Concrete Measures

Paris presentation

DOI

Details of Reproducibility Checks

Verification guidance

A concrete example

We are going to review a fully reproducible example:

  • Step 1: elements of the reproducible analysis
  • Step 2: curation of data for reproducible analysis
  • Step 3: robustness and automation

Requirements and Goals

Requirements

  • web browser
  • some R knowledge (not much)

Goals

  • show you enough of the toolkit to have you explore more
  • recognize (some) of the limitations
  • NOT make you a master of this today

Let's get started

The Census Bureau put out a blog post with data.

  • I attempted to replicate it
  • The replication itself should be replicable

The Context

the original page:

url

original page

We are going to focus on 1 figure

Original

original

Replicated

replicated

Let's start

scan

First problem

When the replicated disappear

Consider the key inputs to this replication:

  • the original article
  • the original data
  • my article replicating the original article
  • the data for my article

stacks

Safeguarding scientific output

The role of journals is to provide a permanent record of scientific knowledge.

  • how reliable is that record?
  • where are journals stored?
  • what if the information is not in a journal?

old library

Safeguarding scientific output

  • journals disappear, as do websites
  • paper journals are stored in libraries
  • e-journals in a system called LOCKSS = Lots of Copies Keep Stuff Safe
  • data should be stored in repositories

tree in library

Solving the first snag

Solving the first snag

Building a replicable document

Building a replicable document

Why would you do this

  • lay out all the steps as “literate programming”
  • can serve as the “README”!
  • ideally runs automatically

Why would you not do this

  • in general, support for citations is weak/ tricky
  • in general, not suggested when running counter to other best practices
    • becomes tricky when long-running computing is involved
    • runs counter to “short, focussed programs doing one thing” rule

Tools for a replicable document

a place to store it

  • Dropbox?
  • Github? Gitlab? Bitbucket?

a place to compute it

  • your laptop?
  • my laptop?
  • a university server?
  • a cloud server?
  • all of the above?

a programming language

  • R
  • Stata
  • Python
  • SPSS

a format for the text

  • Word?
  • \( \LaTeX \)
  • Markdown?

Tools for a replicable document

a place to store it

  • Dropbox?
  • Github! Gitlab! Bitbucket?

a place to compute it

  • your laptop?
  • my laptop?
  • a university server?
  • a cloud server!
  • all of the above!

a programming language

  • R (but don't worry!)
  • Stata
  • Python
  • SPSS

a format for the text

  • Word?
  • \( \LaTeX \)
  • Markdown!

Aside: Markdown

a format for the text

  • Word?
  • \( \LaTeX \)
  • Markdown
  • \( \overline{x} = \frac{1}{N}\sum_{i=1}^N x_i \)

Looks like this

## a format for the text
 - Word?
  - $\LaTeX$
  - **Markdown**
  - $\overline{x} = \frac{1}{N}\sum_{i=1}^N x_i$

Let's start... again

scan

The replicable document

A bit confusing - stay with me

In order to better support this tutorial, I “cloned” my main code repository to a new location:

  • from https:// github.com /larsvilhuber/jobcreationblog
  • to https:// gitlab.com /larsvilhuber/jobcreationblog

and then froze it.

Gitlab? Github? Git? What's up with that?

Gitlab logo Gitlab logo Gitlab logo

Both GitLab (and GitLab.com) and GitHub (and GitHub.com) are products providing Git repository hosting service. [1]

(Optional) Bonus points

If you want to save any modifications you will make, you need to first fork my repository yourself:

  • use the “Fork” button
  • you could also use the “Import project” (Gitlab.com) functionality

and then replace “your name here” for all occurrences of “larsvilhuber” in the Git URLs shown in the tutorial from here on.

Getting our hands dirty

Rather than squint on code on the screen, let's … replicate my replication. Online. Now.

Rstudio.cloud

Logging on to the cloud server

Rstudio.cloud login

Rstudio.cloud workspace

While you do that

Other cloud-based compute environments:

Rstudio.cloud

  • R-focused

MyBinder.org

  • Origins with Jupyter
  • Julia, Python, and R
  • different approach

https://codeocean.com

  • Software-agnostic
    • R
    • Python
    • Stata !
    • Matlab !
    • others
  • but always scripted
  • integrated versioning of the entire compute capsule

Creating a new project

Rstudio.cloud workspace

Rstudio.cloud new project

Rstudio.cloud new project from Github

Creating a new project from Gitlab

Rstudio.cloud new project from Gitlab

Creating a new project from Gitlab

scan

Creating a new project from Gitlab

scan

Notes

You could have done the same thing on your laptop

  • you might not have (the same version of) Rstudio installed (free)
  • you might not have (the same version of) R installed (free)
  • you might have a Mac/ Windows/ Linux/ old / brand new machine

All of these are issues affecting computational reproducibility

However, they do not solve everything…

Open the README document

scan

A (solved) problem of dependencies

scan

Issues of dependencies (new)

You could have done the same thing on your laptop

  • you might not have (the same version of) Rstudio installed (free)
  • you might not have (the same version of) R installed (free)
  • you might have a Mac/ Windows/ Linux/ old / brand new machine
  • you might not have (the same version of) packages installed

Rstudio solves that for you

Go ahead, click on “install”

scan

Solving dependencies

The problem is not just in R:

  • SSC or Stata Journal packages in Stata
  • libraries or compilers in Fortran
  • Modules (paid!) in SPSS or SAS
  • packages in Python (and versions of Python!)

XKCD 1987

Solving dependencies (R)

  • use packrat or checkpoint functionality
  • declare dependencies explicitly [1]
####################################
# global libraries used everywhere #
####################################
# Package lock in - optional
MRAN.snapshot <- "2019-01-01"
options(repos = c(CRAN = paste0("https://mran.revolutionanalytics.com/snapshot/",MRAN.snapshot)))
pkgTest <- function(x)
{
        if (!require(x,character.only = TRUE))
        {
                install.packages(x,dep=TRUE)
                if(!require(x,character.only = TRUE)) stop("Package not found")
        }
        return("OK")
}
global.libraries <- c("dplyr","devtools","rprojroot","tictoc")
results <- sapply(as.list(global.libraries), pkgTest)

Solving dependencies (Stata)

  • install packages locally [1]
  • commit as part of the repository
// Make a path local to the project
// Also see my related config.do at 
//   https://gist.github.com/larsvilhuber/6bcf4ff820285a1f1b9cfff2c81ca02b

local pwd "/c/path/to/project" 
capture mkdir `pwd'/ado

sysdir set PERSONAL `pwd'/ado/personal
sysdir set PLUS     `pwd'/ado/plus
sysdir set SITE `pwd'/ado/site

/* Now install them */
/*--- SSC packages ---*/
foreach pkg in outreg esttab someprog {
  ssc install `pkg'
}

Packages installed?

Click on “Knit”

Problem solved?

Not quite

scan

Another problem (maybe)

Enable popups for this site:

scan

Problem solved NOW?

You should have seen a pop-up window with the compiled text

  • do the graphs look the same?
  • does the text look the same?

Success!

Question:

Are we done?

Not quite…

Important

  • how permanent is my document?
  • how permanent is the data we are using?

Useful

  • how can others easily see my latest version?

Making the document more permanent

Making the document more permanent

  • we could have started on the Open Science Framework (possibly)

OSF

  • we could create a PDF and store it on Cornell's eCommons ecommons

  • we could submit to a journal!

We are going to use Zenodo

zenodo

Zenodo is the social-science (general-purpose) repository managed by CERN

CERN

Why Zenodo?

Because it makes it really easy

  • create a hook from Zenodo to Github
  • create a release on Github
  • a permanent record remains on Zenodo with a DOI DOI
    • even if you delete your Github repo!

For more info, see https://guides.github.com/activities/citable-code/

Zenodo page

Making the page more accessible

Making the page more accessible

Initially, you saw this

replicated page

Creating a webpage from Github-hosted code

  • Go into the settings
  • Tick the box to make it visible
  • Ensure that you have HTML pages (“Github Pages” does not render Markdown)

settings

Having Github (and some friends) create a webpage

We can go one step further

  • Have the document be created automatically when we change and commit

Challenges

  • Code needs to be replicable!
    • all the dependencies need to be solved in our code
    • won't work for paid-for software (Stata, SPSS, SAS)

How permanent is the data?

The data is obtained from a Census Bureau website.

  • The website http://www2.census.gov/ces/bds/ might be re-organized and disappear
  • The data format might change
  • The API might change
  • We only need two small chunks of code

Making the data more permanent

We used Zenodo again, but all the others are just as good!

  • We uploaded manually

zenodo

Using the permanent data

scan

Making code changes cautiously (branching)

If we want to incorporate the Zenodo data

We could

  • make all the changes right away
  • possibly mess up the live site/ latest version of the paper?
  • maybe annoy our co-authors?

But we used a version control system with branching!

We instead

  • created a new branch zenodo
  • made all the changes there
  • can compare the changes to the main branch
  • consult with our co-authors before pulling the changes back into the main branch
  • our live site/paper remains valid the entire time

Compare the changes: Version Control

Compare the changes: Version Control

We could then proceed to incorporate (pull or merge) the changes into the main repository:

scan

Read more about it at https://help.github.com/en/articles/about-pull-requests and https://docs.gitlab.com/ee/gitlab-basics/add-merge-request.html

Final result

The final result would

  • pull data from Zenodo
  • reliably reproduce the graph as presented today
  • use citable data (DOI = 10.5281/zenodo.2649598)
  • be citable itself (DOI = 10.5281/zenodo.400356)

Conclusion

Conclusion

Replication can be a lot of work

We've touched on

  • Replication per se
  • Replicable documents
  • Possible pitfalls of software dependencies
  • Cloud computing platforms
  • Permanence of source material (website, data) and how to solve it

project

Conclusion

We have not covered everything

… because there can be a lot more

  • HP computing (length, quantity, throughput)
  • Issues with commercial (paid) software (access, permanence)
  • Data that is not public-use
  • Data in a locked room

SafePODS

Thank you